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We show that the quality of adaptive differential pcm (adpcm) 
speech can be significantly improved by passing it through a recon- 
struction low-pass filter that is matched to an appropriately defined 
short- time speech cutoff frequency. Practically, the adaptive proce- 
dure involves switching the decoder output into one of a bank of N 
low-pass filters whose cutoff frequencies span the expected range of 
input speech bandwidth. For the case of equally spaced filter cutoffs, 
and with uniform probability density function models for the quan- 
tization noise spectrum and the cutoff frequency, more than one-half 
of the maximum adaptive filtering gain is realizable by a bank of 
four filters. Computer simulations of 16- and 24-kilobit/s adpcm 
coders indicate that perceived quality gains are in fact greater than 
what is indicated by an analytically predicted objective gain of 2.6 
dB. 

I. SHORT-TIME CUTOFF FREQUENCY 

adpcm (adaptive-quantization/differential pcm) speech coding usu- 
ally assumes a time-invariant model for speech bandwidth (such as 
3200 Hz for telephone quality applications), and a corresponding time- 
invariant low-pass filter with cutoff fa = 3200 Hz for the decoded 
speech. However, short-time speech spectra of 3200-Hz-filtered speech 
exhibit cutoff frequencies f c , that are often significantly smaller than 
the long-time nominal cutoff frequency /b. 1 Figure 1 sketches a short- 
time spectrum at time t, and defines f c (t) — fl(t) as the low-pass cut- 
off frequency that includes all but T percent of short-time spectral 
energy. Figure 2 shows long-time-averaged spectra for four sentence- 
length 3200 Hz band-limited utterances [L, B, C, and D], denoting ["A 
lathe is a big tool" (female utterance), "An icy wind raked the 6each" 
(female utterance), "The chairman cast three votes" (female utter- 
ance), and "This is a computer test of a digital speech coder" (male 
utterance)], respectively. It is clear that all the four spectra are low 
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Fig. 1 — Definition of short-time cutoff frequency flit). The shaded area includes T 
percent of short-time speech power. 

pass in a long-time-average sense, a fact that is well exploited in fixed- 
prediction differential coding. 2 Figure 3 shows corresponding histo- 
grams of short-time cutoff frequency fl(t) for a threshold T = 1 
percent. It is seen that on a short-time basis, speech segments can be 
either low pass (say, f L (t) < fo/2) or all pass (say, f L (t) > fo/2), 
although both of these segment types come from inputs that are low 
pass from a long-time-averaged energy viewpoint. It is also clear that 
the four histograms of Figure 3 are very different; however, as a single 
descriptor of these histograms, we propose a uniform probability 
density model for fUt), 

p( flit)) -i 0<f L (t)<f . (1) 

/o 

The adequacy of the above model is clearly a function of the threshold 
T. Clearly for the extreme cases of T = 100 percent and T = percent, 
the pdf of fl(t) would degenerate into delta functions at / = and /o, 
respectively, with corresponding low-pass counts of 100 and percent. 
The uniform density model (1), on the other hand, implies equal, 50 
percent occurrences of low-pass and all-pass segments, and Fig. 4 
shows that this is reasonable as a sentence-ensemble average for 
thresholds T = 1 and 2 percent. A threshold of T = 1 percent has been 
used in all of our adpcm simulations. 

The value of T = 1 percent produces a filtering distortion of 
10 log( 1/100) = —20 dB. More relevantly, it constitutes a "perceptually 
acceptable" low-pass threshold, with a filtering distortion that is ob- 
vious only in critical listening. On the other hand, at T = 2 percent, 
the filtering distortion begins to get obvious, and undesirable even for 
low-quality application such as 16 kilobit/s. The role of threshold Tis 
discussed at greater length in a recent work which relates to the 
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refinement of adm (adaptive delta modulation) with adaptive post- 
filtering. 1 This work also shows that consideration of the low-pass 
threshold flit) is more useful, for purposes of reducing coder noise, 
than the consideration of a high-pass threshold fJt(t)\ in other words, 
providing a band-pass reconstruction filter matched to short-time 
high-pass and low-pass cutoffs [///(£), /z.(0] was not significantly 
better than providing a low-pass reconstruction filter matched to the 
range [0, f L (t)]. 




2 3 

FREQUENCY IN KILOHERTZ 



Fig. 2 — Long-time averaged spectra for four sentence-length inputs. A 3.2-kHz band 
limitation is obvious in all four examples. 



ADPCM SPEECH 709 



0.4 
0.3 
0.2 

0.1 


0.4 

0.3 
0.2 



=) 

m 

< 

O 



L 



- 




I 1 1 1 1 


B 

1 







c 


0.3 




0.2 






0.1 


- 










n 


1 1 


1 L 






1 1 








0.3 


- 


0.2 






0.1 














I 





, ■ I I 


, I , 


I 


I 


I 


I 


I I 



0.2 



0.4 0.6 

[f L /f ) 



0.8 



1.0 



Fig. 3 — Histograms of short-time cut-off frequency fl(t) (T = 1 percent) for the four 
inputs of Fig. 2. The short-time cutoff is in general significantly less than 3.2 kHz, the 
cutoff in the long-time-averaged spectra of Fig. 2. 



II. ADAPTIVE LOW-PASS FILTERS 

Figure 5a shows the effect of a low-pass reconstruction filter in ideal 
adaptive post-filtering; out-of-band adpcm quantization noise compo- 
nents in the cross-hatched range [/l(0, /o] in the noise spectrum are 
rejected by an ideal low-pass filter matched to ft(t). 

Figures 5b and 5c show suboptimal but practical versions of 5a, 
where the value of f L (t) causes switching the decoder output into one 
of a bank of N low-pass filters, leading to a degree of noise-rejection 
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Fig. 4 — Percentages of segments with fl(t) < f /2 in the four inputs of Fig. 2. For 
T = 1 and 2 percent, the mean percentage of such segments is in the order of 50 percent. 



that is always less than that in Fig. 5a. For example, in the N = 2 
example of Fig. 5b, the filter bank consists of two filters with fixed 
cutoffs fc2(= s fo) and f c \, in the upper illustration in Fig. 5b, the value 
of /l(£) is not small enough to switch in the lower filter f c \, conse- 
quently there is no out-of-band noise rejection similar to that in the 
upper example of Fig. 5a. With the uniform pdf model (1), the two- 
filter bank system realizes out-of-band noise rejection only 50 percent 
of the time (when f L (t) < fa = fo/2). The four-filter system of Fig. 5c 
is clearly more effective; it realizes nonzero noise rejections in three 
out of the four cases shown, and indeed for 75 percent of all speech 
segments, if the uniform pdf (1) is valid. 

A block diagram of an iV-filter-bank adaptive system appears in Fig. 
6. Note that for simplicity, the cutoff frequencies are equally spaced, 



n 



fen — fo -jy, 



(2) 



and that filter n is switched on when the input frequency cutoff is in 
the appropriate (f /N)-vride range, 



Switch to filter n if — rr— < 



N 



fo 



n 

N' 



(3) 



III. SEGMENTAL s/n GAINS G(N) 

The spectrum of quantization noise depends on many factors, in- 
cluding the nature of adaptive quantization, the input spectrum, and 
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Fig. 6 — Block diagram of adaptive low-pass filtering system with an Af-filter bank. 
Allowed cutoff frequencies are equally spaced, being integral multiples of fo/N. The 
entire dashed box constitutes an optional refinement of a conventional adpcm decoder. 



the effect of predictor; but experience indicates that the white-noise 
spectrum of Fig. 5 is indeed the single most reasonable model. 1,3 
Combining this assumption with the uniform pdf for the cutoff fre- 
quency ft,(t), we shall now develop expressions for expected gains in 
segmental 2 s/n due to adaptive post-filtering. 

When the input cutoff fh(t) is such as to switch filter n (n = 
1, 2, • • • , N), the noise rejection factor is N/n (see Fig. 5), with a 
maximum value of N for the extremely low values of fait), and a 
minimum value of 1 for extremely high values of /z.(0- The expected 
gain (in dB) is therefore given by 



2 ( , N 
G(N)= J 10 log - 

n=1 V n 



Pr[/i] = -)dB, 



(4) 



where the probability of switching in filter n is Pr [n] = l/N, a result 
of the uniform pdf (1). The filtering gain G(N) can be simply rewritten 
in the form 



10 N 
G(N) = 10 log N-— Slog /i. 



(5) 



Figure 7 plots G(N) as a function of N. 
The asymptotic value 



G(co) = 



10 log 



fo 



hit) 



p(fUt))df L (t) 



(6) 



can be evaluated simply by using the identity 
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Fig. 7 — Segmental s/n gain G(N) versus N. This characteristic assumes uniform pdf 
models for coding noise and ft(t). The asymptotic value is G(oo) = 4.35 dB. More than 
half this gain is realized with N = 4. 



and this results in 



J]nz = z]nz — z, 



G(oo) = 10/e = 4.35 dB. 



(7) 



This asymptotic value is indeed close to the ideal adaptive filtering 
gains reported in earlier-cited adm experiments at several bit rates. 1 It 
is also seen from Fig. 6 that the four-filter bank method of Fig. 5c 
theoretically realizes more than one-half of the maximum possible dB 
gain G(o°) in segmental s/n: G(4) =* 2.6 dB. 

The above analytical formulation can also be used to assess the 
efficiency of the equispaced filter design in (2) and Fig. 6. For illustra- 
tion, G(2) = 1.5 dB with N = 2. It can be shown analytically that an 
optimal design, one that maximizes adaptive filtering gain for N = 2, 
is one for which f ci = (fo/e), rather than /o/2. Figure 8 plots the 
theoretical expected gain in segmental s/n as a function of f c \. The 
maximum gain is 1.6 dB, only 0.1 dB better than that in the simple 
design of (2), which suggests f c \ = fo/2 for N = 2. 

IV. RESULTS OF COMPUTER SIMULATIONS 

adpcm coders with two- and four-bank filtering systems were simu- 
lated with the speech inputs mentioned in Section I. The coder bit 
rates were 24 and 16 kilobit/s, corresponding to 3- and 2-bit/sample 
quantizers, and an 8-kHz sampling of the inputs. The quantizers were 
adaptive. Both backward adaptive (aqb) and forward-adaptive (aqf) 
quantizers were simulated; 4 filtering gains were evident in both cases, 
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Fig. 8 — Segmental s/n gain with N = 2 as a function of /"«./ /o. The maximum value 
of 1.6 dB occurs for fa/fo - e~\ The scheme of Fig. 6 suggests fd/fo = 0.5 for 
N = 2. This produces an s/n gain only 0.1 dB less than the maximum value of 1.6 dB. 

but in terms of absolute quality, the aqf coders were clearly better, 
especially at 16 kilobit/s. aqf coders, however, require the explicit 
transmission of step-size information; using for example, four bits once 
for every segment of 256 samples (16 ms). 

The cutoff frequency f^t) was computed once for every segment of 
256 samples. These segments were Hamming-windowed, zero-padded 
for better frequency resolution, and analyzed by means of a 512-sample 

FFT. 

The filter banks consisted of 33-point fir filters whose frequency 
responses are shown in Figure 9. Although gains due to a two-filter 
system were almost always noticeable, they were not always signifi- 
cant; specifically they were not significant for inputs with predominant 
all-pass (> fo/2) segments; see Figs. 3 and 4. A four-filter bank, on the 
other hand, provided significant quality improvement with all the four 
test inputs. The perceived improvement in quality was much greater 
in all cases than what the theoretically predicted objective gain of 
G(4) = 2.6 dB indicates. The measured gains were very input-depend- 
ent with an average value slightly less than the theoretical value of 
G(4). This result reinforces earlier results for time-varying filtering of 
adm speech, 1 where again perceptual gains were in excess of objectively 
measured segment s/n improvements; this is probably because the 
residual in-band noise after adaptive filtering is much less annoying 
because of masking by the input signal. Adaptive bandwidth adpcm, 
of course, has the additional possibility of variable bit allocation. For 
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Fig. 9 — Frequency responses of low-pass filters in an N = 4 filter bank. Each filter is 
a 33-point fir design, and the cutoffs correspond to those in Fig. 5c and Fig. 6, with 
/o = 3.2 kHz. 



example, significant quality improvements have been noticed in a 
system where the input was subsampled at 5.33 kHz whenever 
fiXt) < 2.4 kHz, and more quantization bits were allocated for subsam- 
pled segments. For example, in 16 kilobit/s operation these segments 
were coded with three-bits/sample instead of two. This type of variable 
bit allocation is ruled out by definition in adm which is a one-bit/ 
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sample system. Notice finally that adpcm with out-of-band noise 
rejection and variable bit allocation is very similar to frequency-do- 
main subband coding. 2 

V. CONCLUSIONS 

We have indicated a conceptually simple but very effective proce- 
dure for improving the quality of adpcm speech coding. The extra 
information for effecting this improvement is very little; with a four- 
filter design this extra information would be log2 4 = two bits, corre- 
sponding to four possible ranges of cutoff frequency fi.(t). The com- 
plexity involved in terms of computing fiXt) is quite significant in 
relation to the simplicity of the conventional, basic adpcm coder. The 
increased complexity, however, will be relatively less objectionable in 
voice storage applications than in transmission systems; and in either 
case, an attractive feature is that the post-filtering procedure (the 
boxed portion of Fig. 6) can be used as an entirely optional refinement. 
Also, as noted earlier, the adaptive post-filtering technique discussed 
in this paper can be used with significant gains in the context of coders 
other than adpcm. 1 
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